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A method  is  presented  for  obtaining  minimum  discrimination 
information  (M.D.I.)  estimates  of  probability  distributions.  This 
involves  using  an  extremal  principle  of  Charnes  and  Cooper  \4]  and, 
viewing  M.D.I.  estimation  in  a dual  convex  programming  framework. 

The  resulting  dual  convex  program  is  unconstrained  and  involves  only 
exponential  and  linear  terms,  and  hence  is  easily  solved.  This  approach 
makes  M.D.I.  estimation  computationally  efficient  and  reduces  the 
time  and  cost  of  obtaining  such  estimates. 
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g l Introduction  and  Summary 

Wiener  (1948  p.  76)  remarked  quite  early  that  entropy  (or 
Shannon -Wiener  type  measures  of  the  amount  of  information)  could 
eventually  replace  Fisher's  definition  [9]  of  information  (see  Kullback 
111]).  Still,  information  theory  is  often  mistakenly  considered  pri- 
marily as  a subfield  of  communication  theory,  where  indeed  Shannon's 
entropy  has  proved  essential  ri4].  The  statistical  community  has  only 
fairly  recently  (after  the  1967  translation  of  Kullback' s 1959  monograph 
into  Russian)  been  assessing  the  use  of  information  theoretic  concepts 
in  inference,  although  there  were  notable  early  recognitions  of  the  statis- 
tical power  of  the  theory  (e.  g.  r 12],  [13],  [16],  see  also  the  references 
in  [11]).  The  information  functional  we  consider  will  be  called  the 
Khinchin-Kullback-Leibler  functional  in  honor  of  their  early  contribu- 
tion to  this  theory.  Modern  contributions  of  Akaike  elucidate  some  of 
this  power,  and  show  that  the  information  theoretic  framework  is  per- 
haps the  proper  approach  to  many  diverse  problems  of  statistics.  In 
[1]  he  gives  an  information  theoretic  extension  of  the  maximum  like- 
lihood principle  and  shows  that  the  Khinchin-Kullback-Leibler  type 
information  functional  naturally  arises  in  statistical  problems.  By 
utilizing  the  maximum  relative  entropy  (or  equivalently  the  minimum 
expected  log  likelihood  ratio)  quantity  he  is  able  to  encompass  both 
statistical  estimation  and  hypothesis  testing  into  a decision  theoretic 


1 


-2- 

framework  using  a Khinchin-Kullback-Leibler  type  loss  function.  His 
techniques  are  applied  to  such  important  considerations  as  the  decision 
of  the  number  of  factors  to  include  in  factor  analysis,  the  number  of 
independent  variates  to  choose  in  multiple  regression,  and  the  order 
of  the  model  when  fitting  an  autoregressive  time  series.  In  [2]  Akaike 
shows  that  using  his  extension  of  the  maximum  likelihood  method 
enables  one  to  obtain  a solution  to  the  problem  of  James -Stein  estima- 
tors. In  t3]  he  looks  at  Bayes  procedures  from  an  information  theoretic 
point  of  view. 

The  information  theoretic  approach  is  based  upon  the  mean 
information  for  discriminating  between  two  densities  f j and  f^  (relative 
to  some  fixed  dominating  measure  X).  The  mean  information  for  dis- 
criminating in  favor  of  fj  against  f 2 is  defined  by  Kullback  [H]  as 
I (f  . f ) = ({  (x)  to  X(dx).  He  also  calls  this  the  directed 

1 ' 2 ) 1 l f 2(x)  J 

divergence  between  the  two  probability  measures  and  shows  I <f  j_ :f2)  2 0 
with  equality  if  and  only  if  i1  = f2  a.e.  [X].  Thus  one  may  estimate 
f2  by  that  density  fj  which  is  closest  in  the  sense  of  information  distance 
to  f,,,  and  one  may  impose  additional  constraints  upon  fj  when  necessary. 

Ct 

This  method  of  estimation  is  called  minimum  discrimination  information 
(M.D.I.)  and  is  based  upon  the  following  inequality: 

M.D.I.  inequality:  Suppose  T (x)  is  a statistic  for  which 

M (T)  = feTT(x)f  (x)  X (dx)  exists  in  an  interval,  and  consider  those 
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densities  satisfying  0 = ^T(x)  f j (x)  X (dx)  for  a known  parameter  0 . 

Then  I(f«:f„)*  0t  -0nM?(T)  where  6 = — Bn  M (T)  with  equality  iff 
1 ^ ^ dT  2 

fj(x)  = f*  (x)  = eT  T(x)  f2  (x)  / M2  (T)  a.  e.  [X]  . 

The  density  fj  (x)  is  called  the  M.D.I.  estimate  of  f2  subject  to 
6 = Jt(x)  fj(x)  d X(x)  (or  the  "conjugate"  distribution  in  Khinchin's 
terminology). 

Since  solving  for  the  M.D.I.  estimate  fj  entails  solving  the  equa- 
tion — - — Bn  M„(t)  = 9 for  t (a  highly  non  linear  problem)  it  has  been 
d t ^ 

quite  difficult  computationally  to  obtain  M.D.I.  estimates.  This  implicit 
differential  equation  relation  is  also  difficult  to  work  with  analytically. 
The  purpose  of  this  paper  is  to  show  how  to  view  M.D.I.  estimation 
from  a dual  convex  programming  point  of  view  and  to  point  out  analytical 
properties  of  the  estimates  which  follow  directly  from  the  form  of  this 
duality.  In  particular,  the  dual  problem  is  unconstrained  and  involves 
only  exponential  and  linear  terms.  This  pair  of  dual  problems  is  easily 
solved  by  any  of  a number  of  existing  algorithms.  The  T variables 
needed  for  f*  and  called  "dual  parameters"  in  Gokhale  and  Kullback 
(1978)  are  here  shown  to  be  actual  variables  in  the  dual  convex  pro- 
gramming problem. 

This  dual  formulation  was  first  developed  by  Charnes  and  Cooper 
[4],  r 5]  (see  also  [7])  who  applied  the  technique  to  show  that  the  account- 
ing balance  equations  for  a cartel  or  "re source -value  transfer"  economy 


could  be  derived  from  Khinchin-Kullback-Leibler  statistical  estimates 


constrained  by  a linear  inequality  system.  Other  new  developments  show 


ing  that  old  heuristic  estimating  procedures  are  actually  constrained 


Khinchin-Kullback-Leibler  estimations,  such  as  (Charnes,  Raike,  Bett 


inger  f 8])  "gravity  potential"  estimates,  SANDDABS  estimates  in  market 


hg  analysis  (Charnes,  Cooper,  Learner  [6]),  and  various  depreciation 


methods  in  accounting  (Theil  and  Lev  [15])  lend  great  weight  to  the  idea 


that  efficient  analytical  and  computational  techniques  for  M.D.I.  estima 


tion  can  be  valuable  in  applied  research 


§ 2 Unconstrained  Dual  Programming  Approach  to  Estimation 

An  important  application  of  M.D.I.  estimation  is  to  the  analysis  of 
contingency  tables,  and  we  shall  utilize  this  example  to  elucidate  the 
techniques  of  this  section.  Denote  a contingency  table  cell  by  the  generic 
lable  U)  and  the  collection  of  all  cells  by  U.  For  a suitable  choice  of 
probability  measure  tt(uu)  over  the  contingency  table  ou  e ft  (in  general 
tt(uu)  is  determined  by  the  specific  problem  of  interest),  Gokhale  and 
Kullback  [10]  pose  the  problem  of  finding  that  probability  measure  p" 

(the  M.D.I.  estimate)  which  minimizes  I (p:  tt)  subject  to  the  equality 
constraints  C£  = 0,  and  the  non- negativity  constraints  g a 0.  Here  £ 

(called  the  design  matrix)  is  an  (r  + 1)  x |ft|  matrix,  £ is  the  1 x |ft| 
vector  of  probabilities  and  9 is  a 1 x (r  + 1)  vector.  If  we  denote  the 
rows  of  C by  C.(u)),  i = 0,  . . . , r,  the  constraints  are  of  the  form 
T Ch  (ui)  p (u>)  = 9^  , i = 0,  . . . , r.  Usually  we  take  6q  = 1 and  Cq(uu)=1 
for  all  u>  e ft  so  that  the  first  constraint  insures  that  £ is  a probability 
measure.  The  remaining  r equations  are  (usually)  moment  constraints.  Gok- 
hale and  Kullback  additionally  require  that  the  vectors  C^(*)  are  linearly 
independent  in  order  to  obtain  their  estimates.  As  we  shall  see,  the  pro- 
gramming formulation  does  not  require  this  assumption.  This  is 
important  since,  even  aside  from  having  to  recognize  linearly  indepen- 
dent constraints,  the  transformation  to  obtain  such  linearly  independent 
constraints  can  be  difficult  and  onerous. 
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II 


By  using  Lagrange  multipliers  Gokhale  and  Kullback  show  that  the 
minimizing  probability  distribution  is 

(2.  1)  p*(w)  = exp  [tq  + t1C1(u))+  . . . + T^CMuu)}  tt(uu) 

where,  they  say,  the  t's  are  to  be  determined  in  such  a way  that  C£v  = ,0  holds. 

The  problem  of  determining  (Tq,  . . . , t^)  so  as  to  satisfy  Cp"  = £ is  in  general, 
quite  difficult.  The  following  Charnes -Cooper  extremal  principle,  however, 
makes  this  very  easy  computationally. 


Let  K (6,x)  - de"C  - 6 x for  6^0,  d > 0,  x e and  define 


g (6)  = inf  K (6,  x)  - b - b 2/r  / — — = -b  2m  [ — — ) . Then  it  follows  that 
x d ted  • 


d 1 led 


for  6 - (6.,  ....  ^n)  t and  & - (Xj, 


xn) 


b: 


(2.2)  v(j6)  = £ g(6.)  = -T  b.Bn  ~~  £ £ [d.exi  - 6.  x.l. 

i i i 1 eai  i i ii 


Suppose  that  x = A z for  some  matrix  A . And  let  .A  denote  the  ith  row  of 
A , and  set  A 6 = b.  Then  (2.  2)  becomes 

i&£ 


(2.  3)  v(6)  = - £ Bn  ( — 1-^  * £ [ d.  e^~  - (b{)  -A  z 1 

i 1 \ edj  J i 1 11 

= £ d e1^  - b'  z 5 §(  z ) 

^ 1 ^ /v  ~ 

which  holds  for  all  z and  6 e i 6 : A 5 = b,  6 at  0 } . 

/v  ^ ^ 'V 


We  then  have: 

Theorem  2.  1 (Charnes  - Cooper  [4],  [5])  For  the  following  dual  convex 
programs 
(I) 


sup  v(5)  = - £ 6.  Bn  subject  to 


A 6 = b 
6 a o 


and 


~ i ....  • 
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(II)  inf  5(z)3  ^ d . e 1~~  - b z 


there  are  exactly  three  mutually  exclusive  and  collectively  exhaustive  duality 
states: 

(1)  A3  ( 6 : 5 z b,  6 2 0 } s <1>  and  ?(z)  is  unbounded  below. 

(2)  II  = 0 for  all  6,  e A ^ 0 and  § (g,)  has  only  an  infimum. 

Further  inf  5(z)  = min  ? (z)  where  ? (z)  contains  only  those 

~ D ~ D ~ 

terms  of  §(z)  for  which  6.  > 0 in  some  6 e A. 

rw  1 'V 

jV 

(3)  There  exists  6 e A with  6 > 0 and  §(z)  has  a minimum  at  z' . 

Further:  (a)  v(M  has  a unique  maximum  at  6 >0 

(b)  v(6*)  = §(z*) 

(c)  6?  = d.eiA& 

i i 


Note  that  there  is  no  requirement  of  linear  independence  made,  and 
all  possible  behaviors  of  the  system  A are  comprehended.  Cf  course,  the 
usual  state  in  applications  will  be  (3).  If  the  requisites  for  state  determina- 
tion are  not  obvious,  the  state  may  be  determined  by  solution  of  the  linear 
programming  problem:  max  ^ 

/ / 

subject  to  jie  - 6 £ 0 

A6  = b 

IV  IV 

6 a 0 


State  (1)  corresponds  to  infeasibility,  state  (2)  corresponds  to  u*  = 0 and 
state  (3)  corresponds  to  u*  > 0 . (c.  f.  [7]) 


-8- 


This  result  is  very  attractive  since  the  dual  problem  (II)  is  an 
unconstrained  convex  programming  problem  involving  only  exponential 
and  linear  terms  and  hence  is  easily  solved  numerically.  The  original 
constrained  Khinchin-Kullback-Leibler  estimation  problem  (I)  was  very 
difficult  to  solve.  It  should  also  be  noted  that  case  3 of  Theorem  2.1  could  equally 
well  have  been  proven  in  the  general  measure-theoretic  case  in  almost 
the  same  manner  as  the  finite  discrete  case  [7],  f 1 7] . 

Referring  back  to  (2.  1)  and  the  determination  of  (Tq,  ....  Tr), 
we  note  that  taking  6 = p(uu),  uj  sO  d = e *tt (ou) , A = C and  b = 0 

in  Theorem  2.  1 yields  the  Gokhale-Kullback  M.D.I.  problem  mentioned 
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distributions  in  [7],  a similar  property  persists  there  as  well.  Thus  the 
M.D.I.  estimated  distributio’  3 have  the  attractive  property  of  preserving 
sufficient  statistics  associated  with  the  target  distribution  since  its  density 
is  multiplied  by  an  exponential  linear  in  z to  get  the  M.D.I.  estimate. 

Also,  Gokhale  and  Kullback  [10]  noted,  when  fitting  marginals  for 
contingency  tables  by  M.D.I.  estimation  one  obtains  the  log-linear  model 
for  contingency  table  entries  as  a by-product.  Here  this  is  immediately 
evident  (without  further  qualifications  or  Gokhale  and  Kullback's  additional 
requirements)  by  taking  logs  in  item  (c)  of  state  3. 

As  alluded  to  in  section  one,  this  dual  approach  to  constrained 
Khinchin-Kullback-Leibler  estimation  has  already  proven  valuable  in 
practice.  In  marketing  research,  Charnes,  Cooper  and  Learner  [6]  substan- 
tially extended  the  SANDDABS  analysis  used  to  evaluate  consumer  purchase 
behavior  and  brand  shifting.  They  showed  that  the  calculations  involved 
in  the  analysis  could  be  viewed  as  constrained  Khinchin-Kullback-Leibler 
estimation,  and  thereby  brought  the  procedure  under  the  ambit  of  sta- 
tistical theory  rather  than  leaving  it  as  merely  a heuristic  estimation 
method.  By  utilizing  the  dual  approach  outlined  in  Theorem  2.  1 they 
were  able  to  improve  the  computational  efficiency  of  the  analysis  as  well. 
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